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Abstract — Computing the partition function and the 
marginals of a global probabiUty distribution are two 
important issues in any probabilistic inference problem. 
In a previous work, we presented sub-tree based upper 
and lower bounds on the partition function of a given 
probabilistic inference problem. Using the entropies of 
the sub-trees we proved an inequaUty that compares the 
lower bounds obtained from different sub-trees. In this 
paper we investigate the properties of one specific lower 
bound, namely the lower bound computed by the minimum 
entropy sub-tree. We also investigate the relationship 
between the minimum entropy sub-tree and the sub-tree 
that gives the best lower bound. 

I. Introduction 

The partition function is of great importance in statis- 
tical physics since most of the thermodynamic variables 
of a system can be expressed in terms of this quantity or 
its derivatives. This quantity also plays an important role 
in many other contexts, including artificial intelligence, 
combinatorial enumeration, approximate inference, and 
parameter estimation. In general, the exact calculation 
of the partition function is computationally intractable 
therefore finding low-complexity estimates and bounds 
is desirable. 

In [5 1, we proposed upper and lower bounds on the 
partition function that depend on the partition function 
of any sub-junction tree of a given junction graph repre- 
senting the inference problem. In [6| a greedy algorithm 
that gives low-complexity upper and lower bounds on 
the partition function was proposed. An inequality was 
proved that compares the lower bounds calculated from 
different sub-junction trees based on their entropies [6, 
Theorem 2]. 

In this paper, we study the properties of the minimum 
entropy sub-junction tree and will extend the results 
of m Theorem 2] by stating new theorems and corol- 



laries. We prove that there is an upper bound on how 
much any other lower bound can be better than the one 
obtained from the minimum entropy sub-tree. We also 
show that the probability distributions over the sub-tree 
that gives the best lower bound and the one with the 
minimum entropy are close in divergence . 

II. Background 

Suppose a global function defined over several random 
variables, e.g. a probability mass function, factors as a 
product of a series of non-negative local kernels, each 
kernel defined over a subset of the set of all random 
variables. The goal is to compute the normalization con- 
stant and the marginals of the global function according 
to those subsets. 

More formally, consider a set {Xi,X2, ■ ■ ■ ,Xj^} of 
N discrete random variables taking their values in a 
finite set A = {0, 1, 2, . . . , a — 1}. Let Xi represent 
the possible realizations of Xi and let x stand for 
{xi,X2, xs,. . . ,xn}- Suppose Ri,R2, . . . , Rm are sub- 
sets of {1,2, ...,iV} and Ji = {Ri, R2, . . . , Rm} is 
a collection of subsets of the indices of the random 
variables Xi through X]\f. Let us also suppose that p(x), 
the joint probability mass function, factors into product 
of finite and non-negative local kernels as 

where each local kernel aji{xR) is a function of the 
variables whose indices appear in R, and Z is the 
partition function, also known as the global normal- 
ization constant whose role is simply normalizing the 
probability distribution. 

In a probabilistic inference problem, we are interested 
in computing Z and the marginal densities pji{-Kji), 



which are defined as 
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III. Graphical Models and the Generalized 
Distributive Law 

Graphical models use graphs to represent and manip- 
ulate joint probability distributions. An efficient way to 
solve a probabilistic inference problem is to represent 
it with a graphical model and use a message passing 
algorithm on this model. 

There are many graphical models in the literature such 
as junction graphs, Markov random fields, and (Forney- 
style) factor graphs. In this paper we focus on graphical 
models defined in terms of junction graphs. Our results 
can be easily expressed with other graphical models. 

Definition 1: A junction graph is an undirected graph 
S = {V, E, L) where each vertex and each edge have 
labels, denoted by L{v), and L{e) respectively. The 
labels on the edges must be a subset of the labels of 
their corresponding vertices. Furthermore, the induced 
subgraph consisting only of the vertices and edges which 
contain a particular label, must be a tree 

We say that S = {V, E, L) is a junction 
graph for the inference problem defined by 3?, if 
{L{vi),L{v2), ■ ■ ■ -.LivM)} = For any probabilistic 
inference a junction graph representation always exists. 

The generalized distributive law (GDL) is an iterative 
message passing algorithm, described by its messages 
and beliefs, to solve the probabilistic inference problem 
on a junction graph. It operates by passing messages 
along the edges of a junction graph, see iQl, ||9l. 

The message sent from a vertex u, to another vertex 
V, is a function of the variables whose indices are 
on e, the edge between v and u, and is denoted by 
nT'u,v{'^L{u,v))- The beliefs on vertices and edges are 
denoted by by{:x.^^^) and 6e(xi(e)), respectively. The 
messages and the beUefs are computed as 

f^u' ,u{^L(u' ,u)) ■ 



Xi(„)\i(„,„) 



u'eN{u)\v 



bei^Lie)) = ■^nT'u,v{^L{e))mv,u{^Lie)), 

where N{y) denotes the neighbors of v\ and Zg are 
the local normalizing constants. 

Theorem 1: On a junction tree the beliefs converge 
to the exact local marginal probabilities after a finite 
number of steps |T, Theorem 3.1]. 

If S is a tree, p(x) defined by ([T]) factors as follows, 
see H 



p(x) 



(4) 



\[e&EPe{^L{e)) ' 

In this case, the entropy of the global distribution 
decomposes as the sum of the entropies of the vertices 
minus the sum of the entropies on the edges. 

Similarly, the global normalization constant Z can be 
expressed in terms of the local normalization constants 
as follows 



Z 



n 



Z,, 



(5) 



'There is a generalization for this definition known as region 
graphs, see LIIJ : for simplicity we prefer to work with junction graphs. 



Therefore if S is a tree, there is an efficient algorithm 
to compute Z, the marginals of p(x), and the entropy 
of p(x). If S is not a tree, the above algorithm is not 
guaranteed to give the exact solution or even to converge, 
although empirically it performs very well. 

IV. Connection to Statistical Physics 

New theoretical results show that there is a connection 
between message passing algorithms and certain approx- 
imations to the energy function in statistical mechanics. 
The idea is that having plausible approximations to the 
energy function gives hope that the minimizing argu- 
ments are also reasonable approximations to the exact 
marginals, see ifTTI . |[8l. See also Q for some new results 
regarding the partition function and loop series. 

V. Sub-tree Based Lower Bounds on the 
Partition Function 

For a general junction graph, calculating the partition 
function, Z, through a straightforward manner as ex- 
pressed in needs a sum with an exponential number 
of terms. Therefore it is desirable to have bounds on Z 
which can be obtained with low complexity, see [10]. 

According to ([5]), on a junction tree the partition 
function can be computed efficiently. In this section we 
derive lower bounds on Z which depend on the partition 
function of St a sub-junction tree of S, see ||5l, IS. 



Consider a probabilistic inference problem defined by 
01 = {Ri, R2, R3, . . . , Rm}- Also consider OIt, a subset 
of Jl that has a junction tree representation. If qri'^) 
denotes the global probability distribution and Zt the 
partition function constant on S^, we can rewrite p(x) 
defined in ([T]l as follows 




Fig. 1. To p via qi or via 52 



(6) a sub-tree in a graph with only one cycle), if H{qi) < 



Take logarithm of both sides of multiply by qri'x.), 
and sum over x. 

^gT(x)lnp(x) = In(^) + ^gT(x)lngT(x) 

X X 

By rearranging (|7]) we obtain 



D(gr(x)|b(x)) = ln(: 



+ Yl 5^ 9(x) In ai?(xK). (8) 

Hence the following 

Y J^gT(x)lna^(xfi) + ln(ZT) <ln(Z). (9) 

If we denote the lower bound obtained using q^, i.e. 
the left hand side of ([9]), by Lg^, the following theorem 
holds. See (E'. Theorem 2]. 

Theorem 2: Consider Jli and JI2, subsets of Jl with 
junction tree representations. Also suppose that gi(x) 
and g2(x) denote the global probability distributions, 
and Zi and Z2 the partition functions over 3?i and 3?2 
respectively. Without loss of generality suppose H{qi ) < 
H{q2), then the following inequality holds 

mill {D{qi\\q^)-D{q2\\qi),D{qi\\q2)+D{qi\\q2)). 

(10) 

Here qi and ^2 denote the global probability distributions 
on 3i \ 3^1 and 3? \ 3?2 respectively. 

Corollary 1: In the case that 3?i = Jl\Jl2, namely 
when the junction graph decomposes into two junction 
trees (for example this can be the case when we choose 



H{q2) the bound in (lOl simplifies to 



-^92+^(921191) < + £'(gi||g2) 



(11) 



Note that in general is not symmetrical therefore 



equation (111 does not tell whether one bound is better 
than the other. According to ([8]l and the definition of 
Lq^, we can see that Lq^ = ln(Z) — D{qT\\p) therefore 
we can rewrite ( [TT| ) as 

D{q2\\qi) + D{qi\\p) < D{qi\\q2) + D{q2\\p). (12) 

In other words, the distance from q2 to p via qi is 
shorter than the distance from qi to p via (72- See Fig. [T] 
Note that in general |-) does not satisfy the triangular 



inequality therefore equation (12 1 does not tell whether 
one bound is better than the other either. 

Clearly, the distance from qi to p is also shorter than 
the distance from qi to p via q2. 



D{qi\\p) < Diqi\\q2) + D{q2\\p). 



(13) 



Corollary 2: Consider a subset Jls of 3? with junction 
tree representation. Also suppose that qs, the probability 
distribution over Rs, has the smallest entropy among all 
the probability distributions on sub-trees. Then for any 
subset JIt of 01 with tree representation the following 
inequality holds 

< + D{qs\\qs). (14) 



Proof: According to Theorem |2] 

mill {D{qs\\qs) - D{qT\\qs),D{qs\\qT)+D{qs\\qT)) ■ 
and hence the following 

< ^qs+D{qs\\qs)-D{qT\\qs) 

< Lq,+D{qs\\qs). 



In other words, the lower bound obtained from any 
sub-tree can not be better than the lower bound ob- 
tained from the minimum entropy sub-tree by more than 
D{qs\\qs), a value that does not depend on qt- This 
gives us a quality guarantee for the lower bound obtained 
from the minimum entropy sub-tree {T\. 

Theorem 3: Consider subsets Jls and 3?^ of with 
junction tree representations. Also suppose that qs, the 
probability distribution over Ois, has the smallest entropy 
and qB, the probability distribution over JIb, gives the 
best lower bound, then the following inequality holds 



DiqB\\qs) < D{qs\\qs) 



(15) 



Proof: Since H{qs) < H{qB) according to Theo- 
rem |2] we can write 

< Lg, + D{qs\\qs) - D{qB\\qs). (16) 
Since qB gives the best lower bound 



Qb ■ 



(17) 



The proof would be clear by adding equations ( 16l and 

m 

Theorem [3] gives us another quality guarantee regard- 
ing the minimum entropy sub-tree (which is the least 
random, least uncertain, and most biased sub-tree). This 
theorem shows that the probability distribution on the 
tree that gives the best bound and the minimum entropy 
distribution are close, where closeness is measured by 
divergence. The upper bound is D{qs\\qs) which does 
not depend on qB Q- See Fig. |2] 

Corollary 3: In the case that Jls = 01\JIb we have 
the following inequality 



DiqB\\qs) < D{qs\\qB) 



(18) 



remark 1: In almost all the theorems and corollaries, 
we insisted that the subsets of 3? have junction-tree 
representations. This assumption can be relaxed and the 
theorems and corollaries would still be valid for the sub- 
graphs. However, having a junction tree representation 
makes the computation of the entropy and the partition 
function easier (using GDL or any other iterative mes- 
sage passing algorithm). 

VI. Conclusion 

In this paper, we extended some of our previous results 
on bounding the partition function. In the case that 
the graph decomposes into two sub-trees we derived a 
number of divergence inequalities concerning the global 




Fig. 2. Upper bound for divergence between qs and qs 

probability distribution and the probability distributions 
on the sub-trees. We showed that the minimum entropy 
sub-tree has some optimality properties, namely the 
lower bound obtained from this tree can not be far 
from the lower bound obtained from any other sub- 
tree and the probability distribution on this tree and 
the probability distribution on the tree that gives the 
best lower bound are close where the divergence is the 
measure of closeness. 
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